CLAIMS 

Having thus described our invention, what we claim as new and desire to 
secure by Letters Patent is as follows: 

1 . A computer-implemented method of executing a matrix operation, said method 
comprising: 

for a matrix A, separating said matrix A into blocks, each said block 
having a size p-by-q; and 
at least one of: 

storing elements in at least one of said blocks in at least one of a 
cache and a memory in a format in which elements of said block occupy a 
location different from an original location in said block; and 

storing said blocks of size p-by-q in said at least one of cache and 
memory in a format in which at least one said block occupies a position different 
relative to its original position in said matrix A. 

2. The method of claim 1, further comprisng: 

loading said blocks from said memory into a first series of data registers so 
that a format of data in said data registers comprises variations of an optimal 
floating point loading instruction of a floating point unit (FPU) of the computer. 
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3. The method of claim 1, wherein said different position comprises one of a 
column major position and a row major position. 

4. The method of claim 1 , wherein said size p-by-q comprises a 2-by-2 block. 

5. The method of claim 2, wherein said variations of an optimal floating point 
loading comprising a crisscrossing of elements about a diagonal of said blocks. 

6. The method of claim 2 5 wherein said matrix operation comprises a linear 
algebra operation, said method further comprising: 

storing a result of said linear algebra operation into one of a second set of 
data registers and a cache memory unit, said storing a result being dictated by an 
optimal FPU storage instruction. 

7. The method of claim 2, wherein said variations of an optimal floating point 
loading instruction in combination with said storing said blocks in a different 
position provides a result that a transpose of said matrix A resides in said data 
registers of said FPU. 
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8. The method of claim 2, wherein said loading comprises a checkerboard 
technique. 

9. The method of claim 6, wherein said linear algebra operation comprises an 
LAPACK BLAS kernel. 

10. An apparatus, comprising: 

a reader to read a data of a matrix A; 

a separator to separate said data into blocks of a size p-by-q; 

a calculator to calculate a position of at least one said block that differs 
from an original position in said matrix A; and 

a memory loader to store said blocks into one of a cache and a memory, 
the different positioning of blocks comprising a register block data format, said 
register block data format being a format in which at least one of : 

elements in at least one of said blocks are stored in at least one of a 
cache and a memory in a format in which elements of said block occupy a 
location different from an original location in said block; and 

said blocks of size p-by-q are stored in said at least one of cache 
and memory in a format in which at least one said block occupies a position 
different relative to its original position in said matrix A. 
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1 1 . The apparatus of claim 10, further comprising: 

a first set of data registers; and 

a data register loader to load said blocks in said register block data format 
into said first set of data registers such that a format of data in said first set of data 
registers comprises a transpose of said matrix A. 

12. The apparatus of claim 1 1, further comprising: 

a calculator to execute a linear algebra processing on said data in said first 
set of data registers. 

13. A data structure in a computer program executing a matrix operation, said 
data structure comprising: 

for a matrix A, separating said matrix A into blocks, each said block 
having a size p-by-q; and 
at least one of: 

storing elements in at least one of said blocks in at least one of a 
cache and a memory in a format in which is elements of said block occupy a 
location different from an original location in said block; and 

storing said blocks of size p-by-q in a format in which at least one 
said block occupies a position different from its original position in matrix A. 
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14. The data structure of claim 13, wherein said size is 2-by-2. 

15. The data structure of claim 13, wherein said different position comprises a 
normal/transpose relationship of said block positions. 

16. The data structure of claim 13 ? wherein said matrix operation comprises a 
LAPACK BLAS kernel. 

1 7. A signal-bearing medium tangibly embodying a program of 
machine-readable instructions executable by a digital processing apparatus to 
perform a method of storing information of a matrix in a register block data 
format, said method comprising: 

receiving data for a matrix A; 

dividing said matrix A data into blocks, each said block having a size 
p-by-q; and 

at least one of: 

storing elements in at least one of said blocks in at least one of a 
cache and a memory in a format in which is elements of said block occupy a 
location different from an original location in said block 
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storing said blocks of size p-by-q in a memory in a format in which 
at least one said block occupies a position different from its original position in 
said matrix A. 

18. The signal-bearing medium of claim 17, said method further comprising: 

loading said blocks from said memory into a plurality of data registers so 
that a format of data in said data registers comprises a transpose of said matrix A. 

19. The signal-bearing medium of claim 18, wherein said loading comprises a 
loading using a checkerboard technique. 

20. A method of providing a service related to at least one of solving and 
applying a scientific/engineering problem, said method comprising at least one of: 

using a linear algebra software package that computes one or more matrix 
subroutines, wherein said linear algebra software package processes a matrix data 
for a matrix A to separate said matrix A into blocks, each said block having a size 
P-by-q, 

stores said matrix in at least one of a cache and a memory in a format 
different from an original format of matrix A, said different format comprising at 
least one of 
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storing elements in at least one of said blocks in at least one of a 
cache and a memory in a format in which is elements of said block occupy a 
location different from an original location in said block; and 

storing said blocks of size p-by-q in a format in which at least one 
said block occupies a position different from its original position in matrix A; 

providing a consultation for purpose of solving at least one of a scientific 
problem and an engineering problem using said linear algebra software package; 

transmitting a result of said linear algebra software package on at least one 
of a network, a signal-bearing medium containing machine-readable data 
representing said result, and a printed version representing said result; and 

receiving a result of said linear algebra software package on at least one of 
a network, a signal-bearing medium containing machine-readable data 
representing said result, and a printed version representing said result. 
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